Weโ€™re going to look at Instacart data.

library(tidyverse)
library(httr)
library(jsonlite)

library(p8105.datasets)
library(plotly)
get_all_inspections = function(url) {
  
  all_inspections = vector("list", length = 0)
  
  loop_index = 1
  chunk_size = 50000
  DO_NEXT = TRUE
  
  while (DO_NEXT) {
    message("Getting data, page ", loop_index)
    
    all_inspections[[loop_index]] = 
      GET(url,
          query = list(`$order` = "zipcode",
                       `$limit` = chunk_size,
                       `$offset` = as.integer((loop_index - 1) * chunk_size)
                       )
          ) %>%
      content("text") %>%
      fromJSON() %>%
      as_tibble()
    
    DO_NEXT = dim(all_inspections[[loop_index]])[1] == chunk_size
    loop_index = loop_index + 1
  }
  
  all_inspections
  
}

url = "https://data.cityofnewyork.us/resource/43nn-pn8j.json"

nyc_inspections = 
  get_all_inspections(url) %>%
  bind_rows() 
## Getting data, page 1
## Getting data, page 2
## Getting data, page 3
## Getting data, page 4
## Getting data, page 5
## Getting data, page 6
## Getting data, page 7
## Getting data, page 8
## Getting data, page 9

The following dataset contains a list of restaurants in Manhattan that fall into 3 categories of restaurant violations: general violation, critical violation, and public health hazards. We want to look at the restaurants with grades A, B, and C, and exclude the rest including Z, N, P, and N/A.

nyc_inspections_df = 
  nyc_inspections %>% 
  select(boro, cuisine_description, inspection_date, violation_code, score, grade) %>% 
  filter(
    grade %in% c("A", "B", "C"),
    boro == "Manhattan") %>% 
  drop_na(grade)

The following box plot represents the score distribution in each grade. Based on the plot, Grade A has a distribution of score of 0-13, grade B has a score of 14-27, and grade C has a score of greater than 27.

nyc_inspections_df %>% 
  plot_ly(
    x = ~grade, y = ~score, color = ~grade,
    type = "box", colors = "viridis")

The bar plot below displays the number of violation codes of restaurants in Manhattan. 10F is the most popular one, it means Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.

nyc_inspections_df %>%
  count(violation_code) %>% 
  mutate(violation_code = fct_reorder(violation_code, n)) %>% 
  plot_ly(
    x = ~n, y = ~violation_code, color = ~violation_code,
    type = "bar", colors = "viridis")
## Warning: Ignoring 1 observations

The following plot presents the distribution of score based on the grade.

score_distribution = 
  nyc_inspections_df %>% 
  ggplot(aes(x = score, fill = grade)) + 
  geom_density(alpha = .4, adjust = .5, color = "blue")

ggplotly(score_distribution)
## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.